{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CS 237 Spring 2021, HW 10\n", "\n", "#### Due date: Thursday April 15th at Midnight (1 minute after 11:59pm on 4/8) via Gradescope (with a 6 hour grace period)\n", "\n", " Late policy: You may submit the homework up to 24 hours late for a 10% penalty. Hence, the late deadline is Friday 4/16 at Midnight (with a 6 hour grace period). \n", "\n", "#### General Instructions\n", "\n", "Please complete this notebook by filling in solutions where indicated. \n", "\n", "For full credit, please take careful note of the following requirements:\n", "\n", "- Do NOT use any HTML tags in your notebook, as Gradescope will ignore them;\n", "\n", "- Do NOT answer questions by including images, as Gradescope will ignore them; and \n", "\n", "- You MUST \"Restart and Run All\" from the Kernel menu before submitting to Gradescope.\n", "\n", "**Any assignments which do not follow these requirements will not receive full credit.** \n", "\n", "\n", "\n", "There are 8 problems on this homework (each worth 7.5 points), 5 analytical and 3 concerning Pandas, a library for managing data files. Pandas is used extensively in data science as part of the \"data wrangling\" phase of a machine learning project. \n", "\n", "The problems for Lab 10 are problems 1 through 3, and the remaining problems are the analytical problems. \n", "\n", "NOTE (added Sunday am): I have cancelled problems 8 and 10, mistake in editing, I did not mean to include them.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# Here are some imports which will be used in code that we write for CS 237\n", "\n", "import matplotlib.pyplot as plt # normal plotting\n", "import numpy as np\n", "\n", "from math import log, pi, log, floor, ceil, sqrt # import whatever you want from math\n", "from random import seed, random\n", "from collections import Counter\n", "\n", "%matplotlib inline\n", "\n", "from scipy.special import comb\n", " \n", "def C(N,K): \n", " return comb(N,K,True) # just a wrapper around the scipy function\n", "\n", "\n", "# Here are the basic statistical functions we will use from numpy\n", "\n", "from numpy import mean, var, std, median\n", "\n", "L = [2,4,3,6,4,5]\n", "\n", "# mean value\n", "\n", "mean(L) \n", "\n", "\n", "# Variance\n", "# ddof = delta degrees of freedom, default is 0\n", "\n", "# population variance\n", "var(L) \n", "\n", "# sample variance\n", "var(L,ddof=1)\n", "\n", "# Standard deviation\n", "# ddof = delta degrees of freedom, default is 0\n", "\n", "# population standard deviation\n", "std(L) \n", "\n", "# sample standard deviation\n", "std(L,ddof=1) \n", "\n", "# Median\n", "\n", "median(L) \n", "\n", "# Random sampling of `size` elements from list with or without replacement\n", "\n", "np.random.choice(L,size=1,replace=True)\n", " \n", "# Scipy statistical functions\n", "\n", "from scipy.stats import norm, binom, expon, geom, poisson, gamma, nbinom, bernoulli \n", "\n", "# https://docs.scipy.org/doc/scipy/reference/stats.html\n", "\n", "#### Normal Distribution #####\n", "\n", "###### Note that in this library loc = mean and scale = standard deviation #####\n", "\n", "# Examples assume random variable X (e.g., housing prices) normally distributed with mu = 60, sigma = 10\n", "\n", "# Probability Density Function (really only useful for drawing the curve)\n", "# f(x) = P(X == x)\n", "\n", "norm.pdf(x=50,loc=60, scale= 10) \n", "\n", "# Cumulative Density Function\n", "# F(x) = P(X < x)\n", "\n", "# Example: Percentage of houses less than 50K. \n", "norm.cdf(x=50,loc=60,scale=10) \n", "\n", "# Example: Find P(60 x)\n", "\n", "# Example: Percentage of houses more than 50K.\n", "norm.sf(x=50,loc=60,scale=10) \n", "\n", "# Percentage Point Function: Inverse of the CDF:\n", "# For what is the largest value of k for which P( X < k ) = q ?\n", "\n", "# Example: What is the maximum cost of the 5% cheapest houses, \n", "# i.e., the x such that P(X < x) = 0.05?\n", "\n", "norm.ppf(q=0.05,loc=60,scale=40)\n", "\n", "# Inverse Survival Function: Inverse (1 - CDF):\n", "# For what is the smallest value of k for which P( X > k ) = q ?\n", "\n", "# Example: What is the minimum cost of the 5% most expensive houses, \n", "# i.e., the x such that P(X > x) = 0.05?\n", "\n", "norm.isf(q=0.05,loc=60,scale=40)\n", "\n", "# Give the endpoints of the interval (centered on the mean)\n", "# which contain alpha/100 percent of the population (alpha is a probability)\n", "\n", "# Ex. Give the interval for the middle 75% of the houses\n", "\n", "norm.interval(alpha=0.75, loc=60, scale=40)\n", "\n", "# generate a random variate\n", "norm.rvs(loc=60, scale=40)\n", "\n", "# generate random variates, returns list of length = size\n", "norm.rvs(loc=60, scale=40, size=10)\n", "\n", "\n", "\n", "\n", "##### Bernoulli Distribution X ~ Bernoulli(p) ####\n", "\n", "# p = probability of success for Bernoulli trial\n", "\n", "# Generate a random variate\n", "bernoulli.rvs(p=0.5)\n", "\n", "# Generate a list of random variates\n", "bernoulli.rvs(p=0.5,size=100)\n", "\n", "##### Binomial Distribution X ~ B(n,p) ####\n", "\n", "# n = number of independent Bernoulli trials\n", "# p = probability of success for Bernoulli trial\n", "# k = outcome in range [0 .. n]\n", "\n", "# Generate a random variate\n", "binom.rvs(n=10, p=0.5)\n", "\n", "# Generate a list of random variates\n", "binom.rvs(n=10, p=0.5,size=100)\n", "\n", "# Probability mass function.\n", "binom.pmf(k=4, n=10, p=0.5)\n", "\n", "# Cumulative distribution function\n", "binom.cdf(k=4, n=10, p=0.5)\n", "\n", "print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas Data Management and Analysis Library\n", "\n", "In this lab, we will learn about the Pandas library for data analysis. Pandas is part of the Anaconda distribution, and there are decent tutorials on most aspects of Pandas; I would recommend the following:\n", "\n", "Pandas Tutorial: http://pandas.pydata.org/pandas-docs/stable/tutorials.html\n", "\n", "Basic functionality: http://pandas.pydata.org/pandas-docs/stable/basics.html\n", "\n", "Indexing and selecting data: http://pandas.pydata.org/pandas-docs/stable/indexing.html\n", "\n", "There are three problems, the first an extended tutorial, and the last two actual problems. You should read through problem 1 carefully and try all the examples. Problems 2 and 3 are more realistic activities based on the material in problem 1.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem One: Basic Pandas data manipulation\n", "\n", "We will first learn how to read data sets from a text or CSV (\"Comma Separated Values\") file, understand the DataFrame data structure, and learn how to extract data from a dataframe; then we will understand how to select rows and columns from a table, and to apply various functions to tables; finally, we will learn how to display histograms of the data. There is a LOT of complexity in all these various aspects of Pandas, but we will learn a \"novice subset\" of the most important features.\n", "\n", "Note: There is a lot of reading and thinking to do in this first problem; when we want you to do something on your own computer, we will indicate it with a **TODO**. All you really have to do for this problem is to try various things\n", "in Pandas. Do not skip ahead, however, you need to practice these before going on!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data as Tables\n", "\n", "The basic form of the data sets manipulated in Pandas (and indeed in all modern database systems) is a 2D table of data with rows and columns; for example, here is a data set we will use as an example:" ] }, { "attachments": { "Screen%20Shot%202021-04-08%20at%2011.39.21%20PM.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "![Screen%20Shot%202021-04-08%20at%2011.39.21%20PM.png](attachment:Screen%20Shot%202021-04-08%20at%2011.39.21%20PM.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a hypothetical list of 10 students at BU; each row is an individual student, and each column gives a specific piece of information about that student. Note that each column contains the same kind of data (i.e., the first three columns have strings, and the remaining are floats--we will assume that all numberic data is represented as floating-point), and each column has a header giving a description of the information in that column. Column headers are not absolutely necessary, but we will make this assumption for now.\n", "\n", "**Note on terminology**: In Pandas, a table is called a **dataframe**; in databases we often call the rows in a table records, the columns are called fields, and the headers are then called field names. This terminology is sometimes used in data analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comma Separated Value (CSV) Files\n", "\n", "A common file format for data stored as text is a CSV file, in which each row is stored on a separate line, with commas between all the fields. The table above would look like this if I opened it with Emacs on my Mac (there would be a newline \\n at the end of each line; on a Windows machine there would be \\r\\n):" ] }, { "attachments": { "Screen%20Shot%202021-04-08%20at%2011.34.40%20PM.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "![Screen%20Shot%202021-04-08%20at%2011.34.40%20PM.png](attachment:Screen%20Shot%202021-04-08%20at%2011.34.40%20PM.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a file format supported by Excel as well, and you can import it into an Excel spreadsheet or save a spreadsheet as a CSV file:" ] }, { "attachments": { "Screen%20Shot%202021-04-08%20at%2011.33.56%20PM.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "![Screen%20Shot%202021-04-08%20at%2011.33.56%20PM.png](attachment:Screen%20Shot%202021-04-08%20at%2011.33.56%20PM.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to create such a file, the easiest way is probably to start with an Excel spreadsheet and simply save it in .csv format!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading and Writing CSV Files in Pandas\n", "\n", "To manipulate such data tables in Python, the best library is Pandas, which you should import as follows:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd # this is already done in the first cell at the top of this notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is an example of reading the csv file `lab10.students.csv` from the 237 data repository using a URL\n", "and storing the data (called a \"data frame\") in a variable `students`. \n", "\n", "You can also read a file from your current local directory (which we demonstrate a few cells below). " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UseridGenderClassYearGPACreditsTransferAP
0bidenMU24.000330.012
1harrisFU43.6304571.18
2blinkenMU43.3801130.012
3emhoffMU23.210280.00
4yellenFU23.810330.04
5austinMU33.0345628.00
6garlandMU43.2008044.024
7haalandFU43.5806630.00
8trumpMU31.230320.00
9penceMU42.0401063.024
\n", "
" ], "text/plain": [ " Userid Gender ClassYear GPA Credits Transfer AP\n", "0 biden M U2 4.000 33 0.0 12\n", "1 harris F U4 3.630 45 71.1 8\n", "2 blinken M U4 3.380 113 0.0 12\n", "3 emhoff M U2 3.210 28 0.0 0\n", "4 yellen F U2 3.810 33 0.0 4\n", "5 austin M U3 3.034 56 28.0 0\n", "6 garland M U4 3.200 80 44.0 24\n", "7 haaland F U4 3.580 66 30.0 0\n", "8 trump M U3 1.230 32 0.0 0\n", "9 pence M U4 2.040 106 3.0 24" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "students = pd.read_csv(\"https://cs-web.bu.edu/fac/snyder/cs237/Data/lab10.students.csv\")\n", "students" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you ask for the value of the dataframe to be printed out, it will print it out in ASCII:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Userid Gender ClassYear GPA Credits Transfer AP\n", "0 biden M U2 4.000 33 0.0 12\n", "1 harris F U4 3.630 45 71.1 8\n", "2 blinken M U4 3.380 113 0.0 12\n", "3 emhoff M U2 3.210 28 0.0 0\n", "4 yellen F U2 3.810 33 0.0 4\n", "5 austin M U3 3.034 56 28.0 0\n", "6 garland M U4 3.200 80 44.0 24\n", "7 haaland F U4 3.580 66 30.0 0\n", "8 trump M U3 1.230 32 0.0 0\n", "9 pence M U4 2.040 106 3.0 24\n" ] } ], "source": [ "print(students)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although we won't need to do this right now, for reference, you can also write out a dataframe to a csv file; the default is to write out the index numbers on each row--in general you want to avoid this. The following command will create an identical file to the one read in, without the pesky index numbers:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "students.to_csv('temp.csv', encoding='utf-8', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading in a local file is the same as reading in a file from a URL:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UseridGenderClassYearGPACreditsTransferAP
0bidenMU24.000330.012
1harrisFU43.6304571.18
2blinkenMU43.3801130.012
3emhoffMU23.210280.00
4yellenFU23.810330.04
5austinMU33.0345628.00
6garlandMU43.2008044.024
7haalandFU43.5806630.00
8trumpMU31.230320.00
9penceMU42.0401063.024
\n", "
" ], "text/plain": [ " Userid Gender ClassYear GPA Credits Transfer AP\n", "0 biden M U2 4.000 33 0.0 12\n", "1 harris F U4 3.630 45 71.1 8\n", "2 blinken M U4 3.380 113 0.0 12\n", "3 emhoff M U2 3.210 28 0.0 0\n", "4 yellen F U2 3.810 33 0.0 4\n", "5 austin M U3 3.034 56 28.0 0\n", "6 garland M U4 3.200 80 44.0 24\n", "7 haaland F U4 3.580 66 30.0 0\n", "8 trump M U3 1.230 32 0.0 0\n", "9 pence M U4 2.040 106 3.0 24" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv('temp.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting Rows and Columns from the Dataframe using Slices\n", "\n", "Pandas gives you a truly bewildering variety of ways to manipulate dataframes, but first we will only need to think about the basics: selecting rows and columns from the table. This amounts to either selecting rows from the table, or selecting columns from the table (or both).\n", "\n", "The basic ideas here is that the dataframe is a two-dimensional matrix, where the rows are indexed by numbers and the columns are indexed by column headers, so that\n", "\n", "> rows are selecting by using normal Python array slices, e.g., `[0:3]`\n", "\n", "and\n", "\n", "> columns are selecting by using a list of column headers, e.g., `[['Userid','GPA']]`\n", "\n", "The double brackets are not a typo! The outer brackets enclose the parameter, which is a list of header names. If there is only a single header name, then you can use a single set of brackets, e.g., `['Userid']`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO\n", "\n", "For the following, first try to predict what will happen, and then try it to confirm your understanding:\n", "\n", "- students[0:3]\n", "- students[5:]\n", "- students[-3:]\n", "- students[:]\n", "- students[1:7:2]\n", "- students[::-1]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UseridGenderClassYearGPACreditsTransferAP
0bidenMU24.00330.012
1harrisFU43.634571.18
2blinkenMU43.381130.012
\n", "
" ], "text/plain": [ " Userid Gender ClassYear GPA Credits Transfer AP\n", "0 biden M U2 4.00 33 0.0 12\n", "1 harris F U4 3.63 45 71.1 8\n", "2 blinken M U4 3.38 113 0.0 12" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "students[0:3]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the point here is that selecting rows is really the same as slicing a list.\n", "\n", "As explained above, you select columns by giving a list of the column headers in a list (with double square brackets). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO\n", "\n", "For the following, first try to predict what will happen, and then try it to confirm your understanding:\n", "- students[ ['Userid', 'GPA'] ]\n", "- students[ ['GPA', 'Gender'] ]\n", "- students[ ['Credits'] ]\n", "- students['Credits']\n", "- students[ ['GPA', 'Userid', 'GPA'] ]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GPAUserid
04.000biden
13.630harris
23.380blinken
33.210emhoff
43.810yellen
53.034austin
63.200garland
73.580haaland
81.230trump
92.040pence
\n", "
" ], "text/plain": [ " GPA Userid\n", "0 4.000 biden\n", "1 3.630 harris\n", "2 3.380 blinken\n", "3 3.210 emhoff\n", "4 3.810 yellen\n", "5 3.034 austin\n", "6 3.200 garland\n", "7 3.580 haaland\n", "8 1.230 trump\n", "9 2.040 pence" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "students[['GPA','Userid']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "You can also combine these two, to get a slice of rows and a selection of columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO\n", "\n", "For the following, first try to predict what will happen, and then try it to confirm your understanding:\n", "- students[0:3][['GPA','Userid']]\n", "- students[::-1][['AP','Gender','Credits']]\n", "- students['GPA'][2:7]\n", "- students[2:7][['GPA']]\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GPAUserid
04.00biden
13.63harris
23.38blinken
\n", "
" ], "text/plain": [ " GPA Userid\n", "0 4.00 biden\n", "1 3.63 harris\n", "2 3.38 blinken" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "students[0:3][['GPA','Userid']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Note carefully the last two examples: it does not matter what order you put the selectors in.\n", "\n", "Finally, you would expect that you could do a slice on column names, like this:\n", "\n", " students[ ['GPA' : 'AP'] ] # error!\n", "\n", "Nope! In order to do this, you need to use the `loc` function, and give it two slices separated by a comma.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO\n", "\n", "For the following, first try to predict what will happen, and then try it to confirm your understanding:\n", "\n", "- students.loc[ 0:3 , 'GPA':'AP' ]\n", "- students.loc[:, 'Userid':'GPA']\n", "- students.loc[:5, 'ClassYear': ]\n", "- students.loc[2:7, 'Userid':'Transfer':2]\n", "- students.loc[::-1, ::-1]\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "a = students.loc[ 0:3 , 'GPA':'AP' ]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GPACreditsTransferAP
04.00330.012
13.634571.18
23.381130.012
33.21280.00
\n", "
" ], "text/plain": [ " GPA Credits Transfer AP\n", "0 4.00 33 0.0 12\n", "1 3.63 45 71.1 8\n", "2 3.38 113 0.0 12\n", "3 3.21 28 0.0 0" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
APTransferCreditsGPAClassYearGenderUserid
9243.01062.040U4Mpence
800.0321.230U3Mtrump
7030.0663.580U4Fhaaland
62444.0803.200U4Mgarland
5028.0563.034U3Maustin
440.0333.810U2Fyellen
300.0283.210U2Memhoff
2120.01133.380U4Mblinken
1871.1453.630U4Fharris
0120.0334.000U2Mbiden
\n", "
" ], "text/plain": [ " AP Transfer Credits GPA ClassYear Gender Userid\n", "9 24 3.0 106 2.040 U4 M pence\n", "8 0 0.0 32 1.230 U3 M trump\n", "7 0 30.0 66 3.580 U4 F haaland\n", "6 24 44.0 80 3.200 U4 M garland\n", "5 0 28.0 56 3.034 U3 M austin\n", "4 4 0.0 33 3.810 U2 F yellen\n", "3 0 0.0 28 3.210 U2 M emhoff\n", "2 12 0.0 113 3.380 U4 M blinken\n", "1 8 71.1 45 3.630 U4 F harris\n", "0 12 0.0 33 4.000 U2 M biden" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "students.loc[::-1, ::-1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting Rows and Columns from the Dataframe using Boolean Expressions\n", "\n", "Pandas gives you lots of ways of selecting data, and a particularly useful way of selecting rows is to specify a boolean expression that the row values must satisfy. For example, to select only those rows representing men, we could do this:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UseridGenderClassYearGPACreditsTransferAP
0bidenMU24.000330.012
2blinkenMU43.3801130.012
3emhoffMU23.210280.00
5austinMU33.0345628.00
6garlandMU43.2008044.024
8trumpMU31.230320.00
9penceMU42.0401063.024
\n", "
" ], "text/plain": [ " Userid Gender ClassYear GPA Credits Transfer AP\n", "0 biden M U2 4.000 33 0.0 12\n", "2 blinken M U4 3.380 113 0.0 12\n", "3 emhoff M U2 3.210 28 0.0 0\n", "5 austin M U3 3.034 56 28.0 0\n", "6 garland M U4 3.200 80 44.0 24\n", "8 trump M U3 1.230 32 0.0 0\n", "9 pence M U4 2.040 106 3.0 24" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "men = students[ students['Gender'] == 'M' ]\n", "men" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and here we select the rows where the Credits are less than the Transfer:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UseridGenderClassYearGPACreditsTransferAP
1harrisFU43.634571.18
\n", "
" ], "text/plain": [ " Userid Gender ClassYear GPA Credits Transfer AP\n", "1 harris F U4 3.63 45 71.1 8" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lessCredits = students[students['Credits'] < students['Transfer']]\n", "lessCredits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO\n", "\n", "For the following, first try to predict what will happen, and then try it to confirm your understanding:\n", "- students[ students['Userid'] >= 'haaland' ]\n", "- students[ students['AP'] == 0 ]\n", "- students[ students['GPA'] < 3.5 ]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UseridGenderClassYearGPACreditsTransferAP
1harrisFU43.634571.18
4yellenFU23.81330.04
7haalandFU43.586630.00
8trumpMU31.23320.00
9penceMU42.041063.024
\n", "
" ], "text/plain": [ " Userid Gender ClassYear GPA Credits Transfer AP\n", "1 harris F U4 3.63 45 71.1 8\n", "4 yellen F U2 3.81 33 0.0 4\n", "7 haaland F U4 3.58 66 30.0 0\n", "8 trump M U3 1.23 32 0.0 0\n", "9 pence M U4 2.04 106 3.0 24" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "students[ students['Userid'] >= 'haaland' ]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To do compound boolean expressions when selecting rows, you have to enclose the expressions in parentheses and use the \"bitwise\" boolean operations `~`, `&`, `|` (instead of the normal Python `not`, `and`, `or`); here is an example:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UseridGenderClassYearGPACreditsTransferAP
2blinkenMU43.3801130.012
3emhoffMU23.210280.00
5austinMU33.0345628.00
6garlandMU43.2008044.024
8trumpMU31.230320.00
9penceMU42.0401063.024
\n", "
" ], "text/plain": [ " Userid Gender ClassYear GPA Credits Transfer AP\n", "2 blinken M U4 3.380 113 0.0 12\n", "3 emhoff M U2 3.210 28 0.0 0\n", "5 austin M U3 3.034 56 28.0 0\n", "6 garland M U4 3.200 80 44.0 24\n", "8 trump M U3 1.230 32 0.0 0\n", "9 pence M U4 2.040 106 3.0 24" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "menGPA = students[(students['Gender'] == 'M') & (students['GPA'] < 3.5 )]\n", "menGPA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Turning a column into a list\n", "\n", "Sometimes you simply want to grab one particular column (say the GPA) into a list, so that you\n", "can manipulate it in Python. This is easy, as you just have to convert a single-column frame into a list:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[4.0, 3.63, 3.38, 3.21, 3.81, 3.034, 3.2, 3.58, 1.23, 2.04]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "G = list(students['GPA'])\n", "G" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Converting a whole data frame into a list of lists\n", "\n", "You can convert the entire data set into Python lists as follows. " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['biden', 'M', 'U2', 4.0, 33, 0.0, 12],\n", " ['harris', 'F', 'U4', 3.63, 45, 71.1, 8],\n", " ['blinken', 'M', 'U4', 3.38, 113, 0.0, 12],\n", " ['emhoff', 'M', 'U2', 3.21, 28, 0.0, 0],\n", " ['yellen', 'F', 'U2', 3.81, 33, 0.0, 4],\n", " ['austin', 'M', 'U3', 3.034, 56, 28.0, 0],\n", " ['garland', 'M', 'U4', 3.2, 80, 44.0, 24],\n", " ['haaland', 'F', 'U4', 3.58, 66, 30.0, 0],\n", " ['trump', 'M', 'U3', 1.23, 32, 0.0, 0],\n", " ['pence', 'M', 'U4', 2.04, 106, 3.0, 24]]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "students.values.tolist()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "You can slice this to get only particular rows, or a single row:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['harris', 'F', 'U4', 3.63, 45, 71.1, 8]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(students[1:2].values.tolist())[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving your work to a file\n", "\n", "Note that each of these expressions we have explored returns a new dataframe, so that if you wanted to create a new data set derived from an existing set, you could assign an expression to a variable and write it out, e.g.," ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "studrev = students.loc[::-1, ::-1]\n", "studrev.to_csv(\"studentrev.csv\", index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or just combine without bothering with the variable:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "students.loc[::-1, ::-1].to_csv(\"studentrev.csv\", index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both of these would put the file studentrev.csv into my work directory, which I could then import into Excel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Functions on DataFrames\n", "\n", "Having read in a dataframe, or created a new one using one of the expressions just shown, we can use a variety of functions to explore and organize the data. Most of these are very intuitive, so we will mostly try a bunch of examples, you can easily explore these in the link Basic Fuctionality: http://pandas.pydata.org/pandas-docs/stable/basics.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO\n", "\n", "For the following, which sort the dataframe, first try to predict what will happen, and then try it to confirm your understanding:\n", "\n", "- students.sort_values('Userid')\n", "- students.sort_values('GPA', ascending=False)\n", "- students.sort_values( ['Gender', 'Userid'] )\n", "- students.sort_values( ['Gender', 'Userid'], ascending=[False,True] )" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UseridGenderClassYearGPACreditsTransferAP
0bidenMU24.000330.012
4yellenFU23.810330.04
1harrisFU43.6304571.18
7haalandFU43.5806630.00
2blinkenMU43.3801130.012
3emhoffMU23.210280.00
6garlandMU43.2008044.024
5austinMU33.0345628.00
9penceMU42.0401063.024
8trumpMU31.230320.00
\n", "
" ], "text/plain": [ " Userid Gender ClassYear GPA Credits Transfer AP\n", "0 biden M U2 4.000 33 0.0 12\n", "4 yellen F U2 3.810 33 0.0 4\n", "1 harris F U4 3.630 45 71.1 8\n", "7 haaland F U4 3.580 66 30.0 0\n", "2 blinken M U4 3.380 113 0.0 12\n", "3 emhoff M U2 3.210 28 0.0 0\n", "6 garland M U4 3.200 80 44.0 24\n", "5 austin M U3 3.034 56 28.0 0\n", "9 pence M U4 2.040 106 3.0 24\n", "8 trump M U3 1.230 32 0.0 0" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "students.sort_values('GPA',ascending=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are also many statistical functions which operate mostly on individual columns:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO\n", "\n", "Let us assume that we have created a new dataframe\n", "\n", "> `st = students['GPA']`\n", "\n", "First try to predict what will happen, and then try it to confirm your understanding.\n", "\n", "- `st.max()`\n", "- `st.min()`\n", "- `st.mean()`\n", "- `st.median()`\n", "- `students['ClassYear'].mode() # the mode is the most frequent value in the list`\n", "- `students['Credits'].sum()`\n", "- `st.count() * 2 + st.max() # weird, just to show that values can be used any way you want!`\n", "- `students[ ['GPA','Credits']].max() # ok, another weird one, it returns the max in each column!`" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.23" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "students['GPA'].min()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Displaying Distributions from DataFrames\n", "\n", "A final function that we should look at is the `hist(...)` function, which will (as you might expect), produce a graphical display of the histogram for a column of values. It is precisely the same function that we studied in the first lab, except that it takes as its values the data in the column(s) of the dataframe. If you give it a whole dataframe, it will give you histograms of all the numeric columns:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": false }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAEICAYAAAB25L6yAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAaD0lEQVR4nO3df5RfdX3n8eeLgD/IAAFDRwyUweoiHFjRzAo0rk2Q3Q0/DtBT6oHyI1jZ9JwFNq64inXbg11pqUeoyFJ3WUHoEg0YsFDYtrKakcUikgAaIVqBTSUYSBEITEQx8No/7h0dJvPjO9/M93s/M/N6nHPPzPfO/X7v+977+b7mfj/3x1e2iYiIcu3SdAERETG+BHVEROES1BERhUtQR0QULkEdEVG4BHVEROES1BHRCEkbJR1b//6Hkj7fdE2lSlB3gaQBSc9Keu2wcddJeknSoKRnJN0p6W1N1hkxkqTfk7S2bqebJf2tpHdP9Xxs/6ntc+t59kmypF2nej7TVYK6wyT1Af8aMHDSiD9/ynYPsD+wBbium7VFjEfSh4DPAH8K9AK/DvwlcPIo0yZUOyhB3XlnA9+iCuFlo01g+6fAF4HDuldWxNgk7QX8CXCe7Vtsb7P9C9t/Y/s/S7pY0mpJN0h6HjhH0i6SLpL0qKSfSLpJ0j7DXvMsSf9U/+3jI+Z3saQb6od31T+fq/fkj5b0FknfkLRV0tOSbuzKiihEgrrzzgZW1sO/k9Q7cgJJPcAZwANdri1iLEcDrwO+Ms40JwOrgXlU7fs/AqcAvwW8CXgWuApA0qHA54Cz6r+9geqT5GjeU/+cZ7vH9j3AfwW+CuxdP+/KdhdsOkpQd1Ddl3cgcJPtdcCjwO8Nm+TDkp4DHgF6gHO6XmTE6N4APG17+zjT3GP7r22/YvtF4A+Aj9veZPvnwMXAqXW3yKnA7bbvqv/2R8Ark6jnF1TvpTfZ/pntu9tZqOkqQd1Zy4Cv2n66fvxFXt398Wnb82y/0fZJth/tfokRo/oJMH+CvufHRzw+EPiKpOfqHZANwMtU/dtvGj697W31PFr1EUDAtyU9JOn3J/HcaS8HADpE0uuB9wFzJD1Zj34tME/S25urLKIl9wA/o+rKWD3GNCNvvfk48Pu2vzlyQkmbgUOGPd6daq+9ldfF9pPAv6+f+27g/0i6y/YjEyzHjJA96s45hWpv4lDgiHo4BPi/VP3WEcWyvRX4Y+AqSadI2l3SbpKOk/SpMZ7234FLJB0IIGlfSUNniKwGTpT0bkmvoTpQOVb+/DNVt8ibh0ZI+l1JQ33az1KF+cs7s4zTSYK6c5YBX7D9I9tPDg3Af6M6cJhPM1E025cDHwL+C1V4Pg6cD/z1GE+5ArgN+KqkF6jOdjqyfq2HgPOouv82U4XtpjHm+1PgEuCbdTfKUcC/Au6VNFjPY4Xt/zcVyzkdKF8cEBFRtuxRR0QULkEdEVG4BHVEROES1BERhevImQfz5893X1/fDuO3bdvG3LlzOzHLxmXZpta6deuetr1vV2e6E4bafMntILW1p1u1jdvmbU/5sHDhQo9mzZo1o46fCbJsUwtY6w60zU4NQ22+5HaQ2trTrdrGa/Mtd31ImiPpAUm3T9E/kIiipc1HKSbTR72C6tr9iNkibT6K0FJQ15dungDkq3JiVkibj5K0ejDxM1R3r9pjrAkkLQeWA/T29jIwMLDDNFue2cqVK2+dVIGHL9hrUtM3ZXBwcNRlnglm8rKNo602P3JdrX9ia1sz70S7L3k7prbxTRjUkk4EttheJ2nxWNPZvhq4GqC/v9+LF+846ZUrb+Wy9ZM70WTjGWPOsigDAwOMtswzwUxettHsTJsfua7OueiOtmroRLsveTumtvG10vWxCDhJ0kZgFXDMsK/MiZiJ0uajKBMGte2P2d7fdh9wGvB122d2vLKIhqTNR2lyZWJEROEm1WFsewAY6EglEQVKm48SZI86IqJwCeqIiMIlqCMiCpegjogoXII6IqJwCeqIiMIlqCMiCpegjogoXII6IqJwCeqIiMIlqCMiCpegjogoXII6IqJwCeqIiMIlqCMiCpegjogoXII6IqJwCeqIiMIlqCMiCpegjogoXII6IqJwCeqIiMIlqCMiCpegjogoXII6IqJwCeqIiMIlqCMiCpegjogoXII6IqJwCeqIiMIlqCMiCpegjogoXII6IqJwCeqIiMIlqCMiCpegjogo3IRBLekASWskbZD0kKQV3Sgsoilp81GaXVuYZjtwoe37Je0BrJN0p+2HO1xbRFPS5qMoE+5R295s+/769xeADcCCThcW0ZS0+SiNbLc+sdQH3AUcZvv5EX9bDiwH6O3tXbhq1aodnr/lma089eLkCjx8wV6Te0JDBgcH6enpaWna9U9sbWseTa2LVpZtqpdpyZIl62z3t/WiU2iybX7kuippW0+mjXbbdK2tne3bTptvOagl9QDfAC6xfct40/b393vt2rU7jL9y5a1ctr6V3pZf2XjpCZOavikDAwMsXry4pWn7LrqjrXk0tS5aWbapXiZJjQd1O21+5LoqaVtPpo1223StrZ3t206bb+msD0m7ATcDKydqsBEzQdp8lKSVsz4EXANssH1550uKaFbafJSmlT3qRcBZwDGSHqyH4ztcV0ST0uajKBN2GNu+G1AXaokoQtp8lCZXJkZEFC5BHRFRuAR1REThEtQREYVLUEdEFC5BHRFRuAR1REThEtQREYVLUEdEFC5BHRFRuAR1REThEtQREYVLUEdEFC5BHRFRuAR1REThEtQREYWb3DfNNqCbXw46lV9UGdFtE7XfCw/fzjltvp+mQt6T7csedURE4RLUERGFS1BHRBQuQR0RUbgEdURE4RLUERGFS1BHRBQuQR0RUbgEdURE4RLUERGFS1BHRBQuQR0RUbgEdURE4RLUERGFS1BHRBQuQR0RUbgEdURE4RLUERGFS1BHRBSupaCWtFTSDyQ9IumiThcV0bS0+SjJhEEtaQ5wFXAccChwuqRDO11YRFPS5qM0rexRvwt4xPZjtl8CVgEnd7asiEalzUdRZHv8CaRTgaW2z60fnwUcafv8EdMtB5bXDw8GfjDKy80Hnt7ZoguVZZtaB9ret8vzBHa6zZfcDlJbe7pV25htftcWnqxRxu2Q7ravBq4e94Wktbb7W5jntJNlm1HabvMlr6vU1p4Samul62MTcMCwx/sDP+5MORFFSJuPorQS1PcBb5V0kKTXAKcBt3W2rIhGpc1HUSbs+rC9XdL5wN8Dc4BrbT/U5vzG7RqZ5rJsM8ROtvmS11Vqa0/jtU14MDEiIpqVKxMjIgqXoI6IKFzXgnomXZIr6VpJWyR9b9i4fSTdKemH9c+9m6yxXZIOkLRG0gZJD0laUY+fEcs3VcZZTxdLekLSg/VwfEP1bZS0vq5hbT2u8W0o6eBh6+ZBSc9L+mBT620y72VVPltn2HclvbMbNUKX+qjrS3L/Efg3VKc+3Qecbvvhjs+8AyS9BxgE/sr2YfW4TwHP2L60/ke0t+2PNllnOyTtB+xn+35JewDrgFOAc5gByzdVxllP7wMGbX+64fo2Av22nx42rqg2WufCE8CRwPtpYL1N5r1c//O4ADi+rvkK20d2o85u7VHPqEtybd8FPDNi9MnA9fXv11O9aacd25tt31///gKwAVjADFm+qTLOeipZadvwvcCjtv+pqQIm+V4+mSrQbftbwLz6H3bHdSuoFwCPD3u8ifIb9WT12t4M1ZsY+LWG69lpkvqAdwD3MgOXb6qMWE8A59cfja9tsIvIwFclrasvdYfytuFpwJeGPS5hvcHY66mxHOtWULd0SW6UQ1IPcDPwQdvPN11PqUZZT58DfgM4AtgMXNZQaYtsv5PqDoDn1R/xi1FfSHQS8OV6VCnrbTyN5Vi3gno2XJL71NDHoPrnlobraZuk3ajCZ6XtW+rRM2b5pspo68n2U7Zftv0K8D+puv26zvaP659bgK/UdZS0DY8D7rf9FPxqvQGLgDOAZZJObKi2sdZTYznWraCeDZfk3kbVuE6j+gh8UH00+V5J/6E+YnydpJckDUp6pj6i/LbhLyLpHEmW9L4mFkKSgGuADbYvH/an24Bl9e/LgFu7XVtJxlpPI/osfxv43sjndqG2ufUBTiTNBf5tXceo27Buj0PDK5JeHPb4jA6VeTrDuj2GrbdPAmuBL9u+vUPznshYbf024Oz6vXwUsHWoi6TjbHdloDpS+o/Ao8DHuzXfDi3Ll6g+nv2C6r/sB4A3AD8EtgPfBX6d6qPSO4CVwGuB64BP1q+xez3+WyNeew3wE+COhpbt3VQf574LPFgPx9fL97V6Gb8G7NP0dmi4DYy1nv4XsL4efxvVmSHdru3NwHfq4aGh91sr2xDYCBw7wevvupP17V638b2GjRtaby8B32xnvbVT1zjv5R3WU/1+vqrOsPVUZ9V0Z5t2uxHN1AHYC9gG/M440/wyqOvHJ1CdkjT0+EDgFeB36sDvbXq5MsyuYbSgptrLvbEOtReoTtU8GvgW8FwddJ8Fdqun37X+J/YHwCPAs8Bnh73evwDuArZS3ef5i8Pm/QrwItUpc3OAecAX6nlsAv4E2KWe/tz6dT5LdebGxU2vv04NuTJx6hxNtdfcUpdAfRDqDOCBYaPPBtbavpnqdK9OfeyMmKzfBr5ItUNyI9WOxAqqm+ovApZSBfNwxwMLqT5Vninp2Hr8JcAdwN5U/bxXAdjuo+rzPc52j6s+6xuogvs3gH6qnZv3D5vHb1K9V/YF/nzKlrYwCeqpMx942vb2oRGS/kHSc3Wf39BR9w9Leo5qT6OHau9kyNlUbwbqn8uIKMPdtv/G9iu2X7R9n+17bW+3/RjVHeZ+a8Rz/sz2VtsbgQGqMzqg6mboo+re+Jntb442Q0kLqM61/k+2f2r7SeAzVMe4hvzI9udcHcB9caoWtjQJ6qnzE2C+pF/eOtb2b9qeV/9taF1/2vY822+0fZLtRwEkLQIOoroYCKqgPlzSEUQ0b/j5w0h6m6Q7JD0p6XmqLon5I57z5LDff0q1YwJwIbAbsFbVZe5j7ZAcSPUp9al6h+c5qr3v3rHqmqkS1FPnHuDntH/F5TKqgxUPSnqSX108cfYU1Baxs0aeL/w/qM4keYvtPYE/ZvTzjHd8oeqqznNt7wecB1wt6aBRJn2cKuD3qXdu5tne0/a/HKeuGSlBPUVsPwd8AvhLSadK6pG0S71HPHe850p6HdU9IpZTfTwcGi4Azhi+lx5RiD2oDgZuk3QIO/ZPj0nS++puDagORhp4eeR0th8HvgF8WtKe9fvpLaVdvNMNCeopZPtTwIeAj1CdJP8U1Z7HR4F/GOepp1AdMPkr208ODVTn6c6hOlATUZILqT4FvkDVxm+cxHOPBO6TtA24BTjP9o/GmPZMqh2dh6nOHvky8MZ2i56u8g0vERGFyx51REThEtQREYVLUEdEFC5BHRFRuI6c9jV//nz39fV14qWn3LZt25g7d9yz54o2U+tft27d07b3baCktozV5kvdPiXWNdtrGrfNd+IGIgsXLvR0sWbNmqZL2CkztX6qe540fjOcVoex2nyp26fEumZ7TeO1+Qm7PiS9TtK3JX1H1bctf2KK/5FEFEfSPEmrJX1f1TeNH910TTF7tdL18XPgGNuD9Tda3C3pb119uWPETHUF8He2T62/7GL3pguK2WvCoK53yQfrh7vVQ66SiRlL0p7Ae6jvbGj7Jaob2kc0oqUrEyXNAdYBbwGusv3RUaZZTnWvCnp7exeuWrVq5CRFGhwcpKenZ+IJCzVR/euf2Drp1zx8wV47U9KkjFX/kiVL1tnu71ohw9T3Z7ma6rLlt1O1/RW2t42YbsI2v+WZrTzVxs03O70NSmz3s72m8dr8pC4hlzSP6osyL7A95nfB9ff3e+3atZMutAkDAwMsXry46TLaNlH9fRfdMenX3HjpCTtR0eSMVb+kJoO6n+rbSxbZvlfSFcDztv9orOeM1eavXHkrl62f/MlVnd4GJbb72V7TeG1+UudRu7pD3AC5SVDMbJuATbaHbjW7Gnhng/XELNfKWR/71nvSSHo9cCzw/U4XFtEUV3cufFzSwfWo91J1g0Q0opXPZPsB19f91LsAN7m5r3GP6JYLgJX1GR+P8erv6YvoqlbO+vgu1ZdTRswath+k+jLViMblXh8REYVLUEdEFC5BHRFRuAR1REThEtQREYVLUEdEFC5BHRFRuAR1REThEtQREYVLUEdEFC5BHRFRuAR1REThEtQREYVLUEdEFC5BHRFRuAR1REThEtQREYVLUEdEFC5BHRFRuAR1REThEtQREYWbMKglHSBpjaQNkh6StKIbhUU0TdIcSQ9Iur3pWmJ227WFabYDF9q+X9IewDpJd9p+uMO1RTRtBbAB2LPpQmJ2m3CP2vZm2/fXv79A1XAXdLqwiCZJ2h84Afh807VEyHbrE0t9wF3AYbafH/G35cBygN7e3oWrVq2auio7aHBwkJ6enqbLaNtE9a9/YmsXq5m8g/aaM2r9S5YsWWe7v4GSAJC0GvgzYA/gw7ZPHGWaCdv8lme28tSLk5//4Qv2mvyTJqHEdj/baxqvzbfS9QGApB7gZuCDI0MawPbVwNUA/f39Xrx4cXvVdtnAwADTpdbRTFT/ORfd0b1i2nDd0rnFrX9JJwJbbK+TtHis6Vpp81euvJXL1rf8NvuljWeMOdspUWK7T01ja+msD0m7UYX0Stu3dLakiMYtAk6StBFYBRwj6YZmS4rZrJWzPgRcA2ywfXnnS4polu2P2d7fdh9wGvB122c2XFbMYq3sUS8CzqLaq3iwHo7vcF0REVGbsPPM9t2AulBLRHFsDwADDZcRs1yuTIyIKFyCOiKicAnqiIjCJagjIgqXoI6IKFyCOiKicAnqiIjCJagjIgqXoI6IKFyCOiKicAnqiIjCJagjIgqXoI6IKFyCOiKicAnqiIjCJagjIgqXoI6IKFyCOiKicAnqiIjCJagjIgqXoI6IKNyEQS3pWklbJH2vGwVFNE3SAZLWSNog6SFJK5quKWa3VvaorwOWdriOiJJsBy60fQhwFHCepEMbrilmsQmD2vZdwDNdqCWiCLY3276//v0FYAOwoNmqYjaT7YknkvqA220fNs40y4HlAL29vQtXrVq1wzTrn9g66QIPX7DXpJ8zmXn1vh6eenHn5tUtoy3T8Pqno4P2mkNPT88O45csWbLOdn8DJb1K3fbvAg6z/fyIv03Y5rc8s7Wt7dPptjg4ODjqem/SbK9pvDY/ZUE9XH9/v9euXbvD+L6L7mjl6a+y8dITJv2cyczrwsO3c9n6XXdqXt0y2jINr386um7pXBYvXrzDeEmNB7WkHuAbwCW2bxlv2rHa/JUrb21r+3S6LQ4MDIy63ps022sar83nrI+IUUjaDbgZWDlRSEd0WoI6YgRJAq4BNti+vOl6Ilo5Pe9LwD3AwZI2SfpA58uKaNQi4CzgGEkP1sPxTRcVs9eEnWe2T+9GIRGlsH03oKbriBiSro+IiMIlqCMiCpegjogoXII6IqJwCeqIiMIlqCMiCpegjogoXII6IqJwCeqIiMJN39uuRcROa+eOlu1q546A3bzjZju6VV/2qCMiCpegjogoXII6IqJwCeqIiMIlqCMiCpegjogoXII6IqJwCeqIiMIlqCMiCpegjogoXII6IqJwLQW1pKWSfiDpEUkXdbqoiKalzUdJJgxqSXOAq4DjgEOB0yUd2unCIpqSNh+laWWP+l3AI7Yfs/0SsAo4ubNlRTQqbT6KItvjTyCdCiy1fW79+CzgSNvnj5huObC8fngw8IOpL7cj5gNPN13ETpip9R9oe99uFwNT3uZL3T4l1jXbaxqzzbdyP2qNMm6HdLd9NXD1JAtrnKS1tvubrqNdqb8jpqzNF7p8RdaVmsbWStfHJuCAYY/3B37cmXIiipA2H0VpJajvA94q6SBJrwFOA27rbFkRjUqbj6JM2PVhe7uk84G/B+YA19p+qOOVdc+0664ZIfVPsSlu88UtX63EulLTGCY8mBgREc3KlYkREYVLUEdEFG7WBrWkAyStkbRB0kOSVjRdU6skvU7StyV9p679E03X1A5JcyQ9IOn2pmvphBIuQx+rnUvaR9Kdkn5Y/9y7gdpetf3rg7f31jXdWB/I7XZN8yStlvT9ep0dXcK6mrVBDWwHLrR9CHAUcN40ukz458Axtt8OHAEslXRUwzW1YwWwoekiOqGgy9DHaucXAV+z/Vbga/Xjbhu5/f8c+Iu6pmeBDzRQ0xXA39l+G/D2ur7G19WsDWrbm23fX//+AtUGWdBsVa1xZbB+uFs9TKujwpL2B04APt90LR1SxGXo47Tzk4Hr68muB07pZl0jt78kAccAqxusaU/gPcA1ALZfsv0cDa8rmMVBPZykPuAdwL3NVtK6+mPjg8AW4E7b06b22meAjwCvNF1IhywAHh/2eBMN7wiMaOe9tjdDFebAr3W5nJHb/w3Ac7a314+bWF9vBv4Z+ELdJfN5SXNpfl0lqCX1ADcDH7T9fNP1tMr2y7aPoLpq7l2SDmu6plZJOhHYYntd07V0UEuXoXdLSe18jO1fwvraFXgn8Dnb7wC20UyX0A5mdVBL2o2q8a60fUvT9bSj/mg2ACxtuJTJWAScJGkjVZfAMZJuaLakKVfMZehjtPOnJO1X/30/qk9m3bLD9qfaw54naegivCbW1yZg07BPp6upgrvJdQXM4qCu+8SuATbYvrzpeiZD0r6S5tW/vx44Fvh+s1W1zvbHbO9vu4/q8uyv2z6z4bKmWhGXoY/Tzm8DltW/LwNu7VZNY2z/M4A1wKlN1FTX9STwuKSD61HvBR6mwXU1pJW7581Ui4CzgPV1Xy/AH9r+3w3W1Kr9gOvrMwt2AW6yPSNPcZuuCrr1wqjtHLgUuEnSB4AfAb/bQG0jfRRYJemTwAPUB/W67AJgZf3P9THg/dTvsSbXVS4hj4go3Kzt+oiImC4S1BERhUtQR0QULkEdEVG4BHVEROES1BERhUtQR0QU7v8DNaBsZLmKMQQAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "students.hist()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is usually more useful to consider a single column at a time:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD4CAYAAAD8Zh1EAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAARL0lEQVR4nO3dX4xcZ3nH8e+DbcDygiPqsESO6UbCqkTj8serJFVuZgOVnBDFF6TUiAaMiFZFRITKqBgugkCqCheBQoOITBM5oSgbRGhlHCOUAtuQiwC7ronjGCqDUuEkSkgCDgtu0MLTiz3Q1WR25+zsmR3P2+9HGnnOvO/MPI/f8W/PnD0zjsxEkjT8XjToAiRJzTDQJakQBrokFcJAl6RCGOiSVIj1g3riLVu25NjY2KCevie/+tWv2LRp06DL6At7G072NpxW09vs7OzTmXl+p7GBBfrY2BgzMzODevqeTE9P02q1Bl1GX9jbcLK34bSa3iLiv5ca85CLJBXCQJekQhjoklQIA12SCmGgS1IhDHRJKkTXQI+Il0bE9yLiBxFxIiI+1mHOSyLi7og4FRHfjYixfhQrSVpanT3054ErMvN1wOuBXRFxWduc9wA/z8zXAJ8GPtlsmZKkbroGei6YqzY3VJf2L1HfDdxRXf8K8KaIiMaqlCR1FXX+g4uIWAfMAq8BPpeZH2obfxjYlZmnq+0fA5dm5tNt8yaBSYDR0dGdU1NTjTSxVubm5hgZGRl0GX1hb8Opqd6OP3amgWp6s2Pr5o63u26dTUxMzGbmeKexWh/9z8zfAq+PiPOAf42IizPz4UVTOu2Nv+AnRWYeAA4AjI+P57B9rNePIg8ne+tu7/57V19Mjx59R6vj7a7byq3oLJfM/AUwDexqGzoNbAOIiPXAZuDZBuqTJNVU5yyX86s9cyJiI/Bm4Idt0w4B76quXwt8K/3PSiVpTdU55HIBcEd1HP1FwJcz83BEfByYycxDwG3AFyPiFAt75nv6VrEkqaOugZ6ZDwFv6HD7TYuu/w/wl82WJklaCT8pKkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFcJAl6RCGOiSVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmF6BroEbEtIr4dEScj4kRE3NhhTisizkTEsepyU3/KlSQtZX2NOfPAvsw8GhEvA2Yj4r7MfKRt3ncy8+rmS5Qk1dF1Dz0zn8jMo9X1XwInga39LkyStDKRmfUnR4wB9wMXZ+Zzi25vAfcAp4HHgQ9m5okO958EJgFGR0d3Tk1NraL0tTc3N8fIyMigy+gLextOTfV2/LEzDVTTmx1bN3e83XXrbGJiYjYzxzuN1Q70iBgB/gP4+8z8atvYy4HfZeZcRFwFfCYzty/3eOPj4zkzM1Pruc8V09PTtFqtQZfRF/Y2nJrqbWz/vasvpkePfuItHW933TqLiCUDvdZZLhGxgYU98C+1hzlAZj6XmXPV9SPAhojY0lO1kqSe1DnLJYDbgJOZ+akl5ryqmkdEXFI97jNNFipJWl6ds1wuB64DjkfEseq2jwCvBsjMW4FrgfdGxDxwFtiTKzk4L0lata6BnpkPANFlzi3ALU0VJUlaOT8pKkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFcJAl6RCGOiSVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRBdAz0itkXEtyPiZESciIgbO8yJiPhsRJyKiIci4o39KVeStJT1NebMA/sy82hEvAyYjYj7MvORRXOuBLZXl0uBz1d/SpLWSNc99Mx8IjOPVtd/CZwEtrZN2w3cmQseBM6LiAsar1aStKTIzPqTI8aA+4GLM/O5RbcfBj6RmQ9U298EPpSZM233nwQmAUZHR3dOTU2ttv41NTc3x8jIyKDL6At7G05N9Xb8sTMNVNOs0Y3w5NlBV9EfF21e1/O6TUxMzGbmeKexOodcAIiIEeAe4AOLw/z3wx3u8oKfFJl5ADgAMD4+nq1Wq+7TnxOmp6cZtprrsrfh1FRve/ffu/piGrZvxzw3H68dUUPl4K5NfXlN1jrLJSI2sBDmX8rMr3aYchrYtmj7QuDx1ZcnSaqrzlkuAdwGnMzMTy0x7RDwzupsl8uAM5n5RIN1SpK6qPN+5nLgOuB4RByrbvsI8GqAzLwVOAJcBZwCfg28u/lSJUnL6Rro1S86Ox0jXzwngfc1VZQkaeX8pKgkFcJAl6RCGOiSVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFaJroEfE7RHxVEQ8vMR4KyLORMSx6nJT82VKkrpZX2POQeAW4M5l5nwnM69upCJJUk+67qFn5v3As2tQiyRpFSIzu0+KGAMOZ+bFHcZawD3AaeBx4IOZeWKJx5kEJgFGR0d3Tk1N9Vr3QMzNzTEyMjLoMvrC3oZTU70df+xMA9U0a3QjPHl20FX0x0Wb1/W8bhMTE7OZOd5prIlAfznwu8yci4irgM9k5vZujzk+Pp4zMzNdn/tcMj09TavVGnQZfWFvw6mp3sb237v6Yhq2b8c8Nx+vc1R4+BzctanndYuIJQN91We5ZOZzmTlXXT8CbIiILat9XEnSyqw60CPiVRER1fVLqsd8ZrWPK0lama7vZyLiLqAFbImI08BHgQ0AmXkrcC3w3oiYB84Ce7LOcRxJUqO6Bnpmvr3L+C0snNYoSRogPykqSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFcJAl6RCGOiSVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEF0DPSJuj4inIuLhJcYjIj4bEaci4qGIeGPzZUqSuqmzh34Q2LXM+JXA9uoyCXx+9WVJklaqa6Bn5v3As8tM2Q3cmQseBM6LiAuaKlCSVE9kZvdJEWPA4cy8uMPYYeATmflAtf1N4EOZOdNh7iQLe/GMjo7unJqa6qno44+d6el+qzW6EV75is0Dee5+9zy6EZ4823lsx9bB9NyUubk5RkZGBl1GLStd5+XWbdiV3NtFm9f1/JqcmJiYzczxTmPrV1XVguhwW8efEpl5ADgAMD4+nq1Wq6cn3Lv/3p7ut1r7dszzth5rXq1+97xvxzw3H+/8cnj0Ha2+Pne/TU9P0+trba2tdJ2XW7dhV3JvB3dt6strsomzXE4D2xZtXwg83sDjSpJWoIlAPwS8szrb5TLgTGY+0cDjSpJWoOv7mYi4C2gBWyLiNPBRYANAZt4KHAGuAk4Bvwbe3a9iJUlL6xromfn2LuMJvK+xiiRJPfGTopJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFcJAl6RCGOiSVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUiFqBHhG7IuJHEXEqIvZ3GN8bET+LiGPV5frmS5UkLWd9twkRsQ74HPAXwGng+xFxKDMfaZt6d2be0IcaJUk11NlDvwQ4lZk/yczfAFPA7v6WJUlaqcjM5SdEXAvsyszrq+3rgEsX741HxF7gH4CfAf8F/G1m/rTDY00CkwCjo6M7p6ameir6+GNnerrfao1uhFe+YvNAnrvfPY9uhCfPdh7bsXUwPTdlbm6OkZGRQZdRy0rXebl1G3Yl93bR5nU9vyYnJiZmM3O801jXQy5AdLit/afA14C7MvP5iPgb4A7gihfcKfMAcABgfHw8W61Wjad/ob377+3pfqu1b8c8b+ux5tXqd8/7dsxz8/HOL4dH39Hq63P32/T0NL2+1tbaStd5uXUbdiX3dnDXpr68JusccjkNbFu0fSHw+OIJmflMZj5fbX4B2NlMeZKkuuoE+veB7RFxUUS8GNgDHFo8ISIuWLR5DXCyuRIlSXV0fT+TmfMRcQPwDWAdcHtmnoiIjwMzmXkIeH9EXAPMA88Ce/tYsySpg1oHqDLzCHCk7babFl3/MPDhZkuTJK2EnxSVpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFcJAl6RCGOiSVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUiFqBHhG7IuJHEXEqIvZ3GH9JRNxdjX83IsaaLlSStLyugR4R64DPAVcCrwXeHhGvbZv2HuDnmfka4NPAJ5suVJK0vDp76JcApzLzJ5n5G2AK2N02ZzdwR3X9K8CbIiKaK1OS1E1k5vITIq4FdmXm9dX2dcClmXnDojkPV3NOV9s/ruY83fZYk8BktfknwI+aamSNbAGe7jprONnbcLK34bSa3v44M8/vNLC+xp077Wm3/xSoM4fMPAAcqPGc56SImMnM8UHX0Q/2NpzsbTj1q7c6h1xOA9sWbV8IPL7UnIhYD2wGnm2iQElSPXUC/fvA9oi4KCJeDOwBDrXNOQS8q7p+LfCt7HYsR5LUqK6HXDJzPiJuAL4BrANuz8wTEfFxYCYzDwG3AV+MiFMs7Jnv6WfRAzS0h4tqsLfhZG/DqS+9df2lqCRpOPhJUUkqhIEuSYUw0NtExO0R8VR1bn2n8VZEnImIY9XlprWusVcRsS0ivh0RJyPiRETc2GFORMRnq69xeCgi3jiIWleqZm9DuXYR8dKI+F5E/KDq7WMd5gzl12/U7G1vRPxs0bpdP4haexER6yLiPyPicIexxtesznno/98cBG4B7lxmzncy8+q1KadR88C+zDwaES8DZiPivsx8ZNGcK4Ht1eVS4PPVn+e6Or3BcK7d88AVmTkXERuAByLi65n54KI5f/j6jYjYw8LXb/zVIIpdoTq9Ady9+MOMQ+RG4CTw8g5jja+Ze+htMvN+Cj2HPjOfyMyj1fVfsvBC29o2bTdwZy54EDgvIi5Y41JXrGZvQ6lai7lqc0N1aT+bYSi/fqNmb0MpIi4E3gL88xJTGl8zA703f169Rfx6RPzpoIvpRfX27g3Ad9uGtgI/XbR9miELxmV6gyFdu+qt+zHgKeC+zFxy3TJzHjgD/NHaVtmbGr0BvLU6BPiViNjWYfxc9I/A3wG/W2K88TUz0FfuKAvfpfA64J+AfxtwPSsWESPAPcAHMvO59uEOdxmaPaYuvQ3t2mXmbzPz9Sx8UvuSiLi4bcrQrluN3r4GjGXmnwH/zv/t1Z6zIuJq4KnMnF1uWofbVrVmBvoKZeZzv3+LmJlHgA0RsWXAZdVWHae8B/hSZn61w5Q6X/VwTurW27CvHUBm/gKYBna1DQ39128s1VtmPpOZz1ebXwB2rnFpvbgcuCYiHmXhG2qviIh/aZvT+JoZ6CsUEa/6/XGuiLiEhb/DZwZbVT1V3bcBJzPzU0tMOwS8szrb5TLgTGY+sWZF9qhOb8O6dhFxfkScV13fCLwZ+GHbtKH8+o06vbX9DucaFn4/ck7LzA9n5oWZOcbCJ+e/lZl/3Tat8TXzLJc2EXEX0AK2RMRp4KMs/KKGzLyVhb/490bEPHAW2DMM/3AqlwPXAcerY5YAHwFeDX/o7whwFXAK+DXw7gHU2Ys6vQ3r2l0A3BEL/9nMi4AvZ+bhKOPrN+r09v6IuIaFM5meBfYOrNpV6vea+dF/SSqEh1wkqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSrE/wLzH8ifOkJJhgAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "students['GPA'].hist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can set some of the parameters that work with the `hist(...)` function, such as setting the bin boundaries, colors, and so on:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD4CAYAAAD8Zh1EAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAARC0lEQVR4nO3db4xcdb3H8feHUv/cQCSxG0toYb3RJ2oEcYMYkhuC3gSVlAdiUhP/YDRNvBIx18SoDzDyzCdqFCOpQizqVQwaUwnkBoNEfUB1WwuC1ZteI6EBywpaJCim8r0P9pC7Tmd2zrazO8PP9yuZcM6cX+d88lvms6enZ+akqpAkPf+dNu0AkqTJsNAlqREWuiQ1wkKXpEZY6JLUiNOnteMtW7bU/Pz8tHYvSc9L+/fv/0NVzQ3bNrVCn5+fZ3FxcVq7l6TnpSQPjdrmKRdJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUiN6FnmRTkl8kuX3IthcmuTXJ4ST7ksxPMqQkaby1HKFfCxwase39wB+r6hXA54DPnGowSdLa9Cr0JNuAtwFfHTHkSmBPt3wb8KYkOfV4kqS++n5S9PPAx4AzR2w/B3gYoKqOJzkGvBT4w8pBSXYBuwDOPffck8kraQbNb93KQ0ePTjvGCf7ltNN4+tlnpx3jBOe97GX87ve/n/jrjj1CT3IF8FhV7V9t2JDnTrgVUlXtrqqFqlqYmxv6VQSSnoceOnqUgpl7PP3ss1PPMOyxXr/8+pxyuQTYkeR3wLeBy5J8Y2DMEWA7QJLTgZcAT0wwpyRpjLGFXlWfqKptVTUP7ATurqp3DQzbC7y3W76qG+PNSiVpA530ty0muR5YrKq9wE3A15McZvnIfOeE8kmSelpToVfVPcA93fJ1K57/K/COSQaTJK2NnxSVpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDWiz02iX5TkZ0nuS/Jgkk8PGXN1kqUkB7vHB9YnriRplD53LHoGuKyqnkqyGfhpkjur6t6BcbdW1TWTjyhJ6mNsoXc3e36qW93cPbwBtCTNmF7n0JNsSnIQeAy4q6r2DRn29iT3J7ktyfaJppQkjdWr0Kvq71V1AbANuCjJawaG/ACYr6rXAj8E9gx7nSS7kiwmWVxaWjqV3JKkAWu6yqWq/gTcA1w+8PzjVfVMt/oV4PUj/vzuqlqoqoW5ubmTiCtJGqXPVS5zSc7qll8MvBn49cCYs1es7gAOTTKkJGm8Ple5nA3sSbKJ5V8A36mq25NcDyxW1V7gw0l2AMeBJ4Cr1yuwJGm4LF/EsvEWFhZqcXFxKvuWNFlJZvLStzCbl+QFONnuTbK/qhaGbfOTopLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktSIPvcUfVGSnyW5L8mDST49ZMwLk9ya5HCSfUnm1yOsJGm0PkfozwCXVdX5wAXA5UkuHhjzfuCPVfUK4HPAZyYbU5I0zthCr2VPdaubu8fgzfCuBPZ0y7cBb0qSiaWUJI3V6xx6kk1JDgKPAXdV1b6BIecADwNU1XHgGPDSIa+zK8liksWlpaVTSy5J+ge9Cr2q/l5VFwDbgIuSvGZgyLCj8RNuaV1Vu6tqoaoW5ubm1p5WkjTSmq5yqao/AfcAlw9sOgJsB0hyOvAS4IkJ5JMk9dTnKpe5JGd1yy8G3gz8emDYXuC93fJVwN1VdcIRuiRp/ZzeY8zZwJ4km1j+BfCdqro9yfXAYlXtBW4Cvp7kMMtH5jvXLbEkaaixhV5V9wOvG/L8dSuW/wq8Y7LRJElr4SdFJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqRF97im6PcmPkhxK8mCSa4eMuTTJsSQHu8d1w15LkrR++txT9Djw0ao6kORMYH+Su6rqVwPjflJVV0w+oiSpj7FH6FX1aFUd6Jb/DBwCzlnvYJKktVnTOfQk8yzfMHrfkM1vTHJfkjuTvHrEn9+VZDHJ4tLS0prDSpJG613oSc4Avgt8pKqeHNh8ADivqs4Hvgh8f9hrVNXuqlqoqoW5ubmTzSxJGqJXoSfZzHKZf7Oqvje4vaqerKqnuuU7gM1Jtkw0qSRpVX2ucglwE3Coqj47YszWbhxJLupe9/FJBpUkra7PVS6XAO8GfpnkYPfcJ4FzAarqRuAq4INJjgN/AXZWVa1DXknSCGMLvap+CmTMmBuAGyYVSpK0dn5SVJIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhrR556i25P8KMmhJA8muXbImCT5QpLDSe5PcuH6xJUkjdLnnqLHgY9W1YEkZwL7k9xVVb9aMeYtwCu7xxuAL3f/lSRtkLFH6FX1aFUd6Jb/DBwCzhkYdiVwSy27FzgrydkTTytJGmlN59CTzAOvA/YNbDoHeHjF+hFOLH2S7EqymGRxaWlpbUklSavqXehJzgC+C3ykqp4c3Dzkj9QJT1TtrqqFqlqYm5tbW1JJ0qp6FXqSzSyX+Ter6ntDhhwBtq9Y3wY8curxJEl99bnKJcBNwKGq+uyIYXuB93RXu1wMHKuqRyeYU5I0Rp+rXC4B3g38MsnB7rlPAucCVNWNwB3AW4HDwNPA+yYfVZK0mrGFXlU/Zfg58pVjCvjQpEJJktbOT4pKUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSI/rcU/TmJI8leWDE9kuTHEtysHtcN/mYkqRx+txT9GvADcAtq4z5SVVdMZFEkqSTMvYIvap+DDyxAVkkSadgUufQ35jkviR3Jnn1qEFJdiVZTLK4tLQ0oV1LkmAyhX4AOK+qzge+CHx/1MCq2l1VC1W1MDc3N4FdS5Kec8qFXlVPVtVT3fIdwOYkW045mSRpTU650JNsTZJu+aLuNR8/1deVJK3N2KtcknwLuBTYkuQI8ClgM0BV3QhcBXwwyXHgL8DOqqp1SyxJGmpsoVfVO8dsv4HlyxolSVPkJ0UlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpEWMLPcnNSR5L8sCI7UnyhSSHk9yf5MLJx5QkjdPnCP1rwOWrbH8L8MrusQv48qnHkiSt1dhCr6ofA0+sMuRK4JZadi9wVpKzJxVQktTPJM6hnwM8vGL9SPfcCZLsSrKYZHFpaWkCu1Zf81u3kmTmHvNbt057aoZyvvR8dPoEXiNDnqthA6tqN7AbYGFhYegYrY+Hjh4d/kOZshw9Ou0IQzlfej6axBH6EWD7ivVtwCMTeF1J0hpMotD3Au/prna5GDhWVY9O4HUlSWsw9pRLkm8BlwJbkhwBPgVsBqiqG4E7gLcCh4GngfetV1hJ0mhjC72q3jlmewEfmlgiSdJJ8ZOiktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1IhehZ7k8iS/SXI4yceHbL86yVKSg93jA5OPKklaTZ97im4CvgT8O3AE+HmSvVX1q4Ght1bVNeuQUZLUQ58j9IuAw1X126r6G/Bt4Mr1jSVJWqs+hX4O8PCK9SPdc4PenuT+JLcl2T7shZLsSrKYZHFpaekk4kqSRulT6BnyXA2s/wCYr6rXAj8E9gx7oaraXVULVbUwNze3tqSSpFX1KfQjwMoj7m3AIysHVNXjVfVMt/oV4PWTiSdJ6qtPof8ceGWSlyd5AbAT2LtyQJKzV6zuAA5NLqIkqY+xV7lU1fEk1wD/DWwCbq6qB5NcDyxW1V7gw0l2AMeBJ4Cr1zGzJGmIVA2eDt8YCwsLtbi4OJV9/zNKcsI/fMyCANP6f3A1ztfazPR8TTvEEKfyc0yyv6oWhm3zk6KS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUiF6FnuTyJL9JcjjJx4dsf2GSW7vt+5LMTzqoJGl1Yws9ySbgS8BbgFcB70zyqoFh7wf+WFWvAD4HfGbSQSVJq+tzhH4RcLiqfltVfwO+DVw5MOZKYE+3fBvwpiSZXExJ0jin9xhzDvDwivUjwBtGjamq40mOAS8F/rByUJJdwK5u9akkvzmZ0MCWwdeeEbOaC2BLZjPbliSzmcv5WovZna9ZzXXyP8fzRm3oU+jDjrQHb1fdZwxVtRvY3WOfqwdKFkfd9XqaZjUXzG42c62Nudbmny1Xn1MuR4DtK9a3AY+MGpPkdOAlwBOTCChJ6qdPof8ceGWSlyd5AbAT2DswZi/w3m75KuDuqjrhCF2StH7GnnLpzolfA/w3sAm4uaoeTHI9sFhVe4GbgK8nOczykfnO9QzNBE7brJNZzQWzm81ca2OutfmnyhUPpCWpDX5SVJIaYaFLUiNmutBn9SsHeuS6OslSkoPd4wMblOvmJI8leWDE9iT5Qpf7/iQXzkiuS5McWzFf121Apu1JfpTkUJIHk1w7ZMyGz1fPXBs+X91+X5TkZ0nu67J9esiYDX9P9sw1rffkpiS/SHL7kG2Tn6uqmskHy/8A+7/AvwIvAO4DXjUw5j+AG7vlncCtM5LrauCGKczZvwEXAg+M2P5W4E6WPzdwMbBvRnJdCty+wXN1NnBht3wm8D9Dfo4bPl89c234fHX7DXBGt7wZ2AdcPDBmGu/JPrmm9Z78T+C/hv281mOuZvkIfVa/cqBPrqmoqh+z+vX/VwK31LJ7gbOSnD0DuTZcVT1aVQe65T8Dh1j+xPNKGz5fPXNNRTcPT3Wrm7vH4FUVG/6e7JlrwyXZBrwN+OqIIROfq1ku9GFfOTD4P/Y/fOUA8NxXDkw7F8Dbu7+m35Zk+5Dt09A3+zS8sfsr851JXr2RO+7+qvs6lo/sVprqfK2SC6Y0X90phIPAY8BdVTVyzjbwPdknF2z8e/LzwMeAZ0dsn/hczXKhT+wrByaszz5/AMxX1WuBH/L/v4WnbRrz1ccB4LyqOh/4IvD9jdpxkjOA7wIfqaonBzcP+SMbMl9jck1tvqrq71V1AcufGL8oyWsGhkxlznrk2tD3ZJIrgMeqav9qw4Y8d0pzNcuFPqtfOTA2V1U9XlXPdKtfAV6/zpn66jOnG66qnnzur8xVdQewOcmW9d5vks0sl+Y3q+p7Q4ZMZb7G5ZrWfA1k+BNwD3D5wKapfg3IqFxTeE9eAuxI8juWT8teluQbA2MmPlezXOiz+pUDY3MNnGfdwfJ50FmwF3hPd/XGxcCxqnp02qGSbH3u3GGSi1j+//Lxdd5nWP6E86Gq+uyIYRs+X31yTWO+un3NJTmrW34x8Gbg1wPDNvw92SfXRr8nq+oTVbWtquZZ7oi7q+pdA8MmPld9vm1xKmo2v3Kgb64PJ9kBHO9yXb3euQCSfIvlKyC2JDkCfIrlfyCiqm4E7mD5yo3DwNPA+2Yk11XAB5McB/4C7NyAX8yXAO8GftmdewX4JHDuilzTmK8+uaYxX7B8Bc6eLN/05jTgO1V1+7Tfkz1zTeU9OWi958qP/ktSI2b5lIskaQ0sdElqhIUuSY2w0CWpERa6JDXCQpekRljoktSI/wOaWoJJ7iKhiAAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "students['GPA'].hist(bins=[0.0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0],color='r',edgecolor='k',grid=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also use most of the standard matplotlib functions to add titles, change the size, etc.:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(10,5))\n", "plt.title(\"GPA Data\")\n", "students['GPA'].hist(bins=[0.0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0],color='g',edgecolor='k')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem Two\n", "\n", "Download the following file of student data for 4287 students at a modern university: \n", "\n", "> https://www.cs.bu.edu/fac/snyder/cs237/Data/studentdata.csv \n", "\n", "and do the following:\n", "\n", "(A) Print out the mean GPA for men;\n", "\n", "(B) Print out the mean GPA for seniors (U4);\n", "\n", "(C) Display the 10 individuals with the largest number of credits earned, sorted in descending order by GPA;\n", "\n", "(D) Display a histogram of the GPA of all individuals, with bins for each letter grade, i.e., 0.0, 1.0, 2.0, 2.33, 2.67, 3.0, 3.33, 3.67, and 4.0.\n", "\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "# Your solution here\n", "# Make sure to run it before submitting it" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem Three\n", "\n", "Download the following file of heights and weights of 25,000 individuals: \n", "\n", "> https://cs-web.bu.edu/fac/snyder/cs237/Data/biometricdata.csv \n", "\n", "and do the following:\n", "\n", "(A) Print out the maximum, minimum, mean, and (unbiased) standard deviation (all functions listed above) for the heights of all individuals;\n", "\n", "(B) Print out the mean height for all individuals weighing more than 130 pounds;\n", "\n", "(C) Print out how many individuals have a height >= 65 inches and <= 70 inches;\n", "\n", "(D) Display a histogram of the heights of all individuals from the minimum to the maximum, where each bin represents 1 inch. Calculate the bin boundaries so that the bins are centered on the height, that is, the boundaries are\n", "half way between each inch measurement:\n", "\n", "> [ .... 64.5, 65.5, 66.5, 67.5, ..... ]\n", "\n", "The left edge of the lowest bin would be the minimum value (rounded to inches) - 0.5, and the right edge of\n", "the highest bin would be the maximum (rounded to inches) + 0.5. The height values themselves are not rounded, just\n", "the maximum and minimum, to get the appropriate bin edges. \n", "\n", "Also, provide an appropriate title, and make the histogram larger using an appropriate figsize, as shown above. \n", "\n", "Hint: For (D), create the bin edges using the function `np.arange(...)` (Google it!). \n" ] }, { "cell_type": "markdown", "metadata": { "scrolled": false }, "source": [ "**Solution:**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ## Problem Four (Normal Distribution)\n", " \n", "The lifetime of backup battery systems made by a \n", "Company A has a mean of\n", "5 years and a standard deviation of 2 years. Those made by Company B\n", "have a mean of 4 years and a standard deviation of 18 months. \n", "Suppose Wayne buys one backup system from Company A, and also one from\n", "Company B. The one from Company A lasts 4 years and 3 months, and the one\n", "from Company B lasts 3 years and 9 months. \n", "\n", "(A) Which of these backup battery systems performed relatively better, compared\n", "with other systems from the same company? \n", "\n", "(B) If the backup system from Company A lasts 5 years, how long would the system from Company B last if it lasted the same amount of time relative to the performance from each company? \n", "\n", "\n", "Hint: Standardize the two normal distributions and compare!\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem Five (Normal Approximation to the Binomial)\n", "\n", "This problem concerns the normal approximation to the binomial. The "continuity correction" (sometimes called Yates's Continuity Correction) was shown in lecture. \n", "\n", "In this problem we will measure the accuracy of the approximations by using\n", "the percent error, as in previous homeworks. \n", "\n", "

\n", "

(A) Suppose of all the kids that show up on Halloween night, 58% are dressed in Spiderman costumes. If 60 kids show up, what is the probability that between 33 and 38 (inclusive) kids will be dressed in Spiderman costumes? (Use the binomial.)

\n", "

(B) Repeat the previous question, but using the normal approximation to the binomial, without using the continuity correction, and express the accuracy of your approximation using the percentage error.

\n", "

(C) Repeat the previous question, but now using the continuity correction, again showing the accuracy using the percentage error.

\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution:**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem Six (CLT)\n", "\n", "

Suppose the heights of 3400 male students at a university are normally distributed with mean 68 inches and standard deviation 3 inches. That is, you have a random variable $X$ which uniformly at random chooses a male student and returns his height, where $E(X) = 68.0$ and $\\sigma_X = 3.0$.

\n", "

(A) Supposing sample groups of 25 men are taken from this population (with replacement)\n", " and the average height of the group calculated, i.e., you are investigating the random variable $\\overline{X}_{25}$. What would be the expected value and standard deviation for $\\overline{X}_{25}$.?

\n", "

(b) Supposing we wanted to get more accuracy in our sampling procedure, so that we wanted the standard deviation of the result to be at most 0.25 inches. What is the smallest sample size we could use to insure this? (Formally, what is the smallest $n$ for which $\\sigma_{\\overline{X}_n} \\le 0.25$?)

\n", "

(c) Supposing you take 80 samples of size 25 (80 \"pokes\" of $\\overline{X}_{25}$), in how many samples would you *expect* to find the output from $\\overline{X}_{25}$ between 66.8 and 68.3 inches? The expected value may be a floating-point number.

\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution:**\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem Seven (Sampling Theory) \n", "\n", "This problem considers three different ways of answering a question about samples from an infinite population. Suppose you flip a fair coin 120 times. What is the probability that 75 or more of the flips will be heads?\n", "\n", "(A) First solve this problem precisely using the binomial, showing your formula (you'll need to use Python -- look at the end of the previous cell above to see some useful functions). \n", "\n", "(B) Next, solve the problem by using the normal approximation to the binomial (using the continuity correction). \n", "\n", "Finally, we will solve this as a problem in sampling: let $X$ be a Bernoulli (each coin flip) and let \n", "\n", "$$\\overline{X} = {X_1 + \\cdots + X_{120})\\over 120},$$ so that by the CLT, $\\overline{X}\\sim N(\\mu,\\sigma^2)$\n", "for some mean $\\mu$ and variance $\\sigma^2$. \n", "\n", "(C) Give the mean $\\mu$ and variance $\\sigma^2$.\n", "\n", "(D) Now calculate the answer using the CLT, using the continuity correction\n", "(where effectively the bins have width 1/120, so you should adjust the boundary by\n", "one half the width of the bin), and give the percentage error. \n", "\n", "Hint: You should get the same answer for (B) and (D); they will be close to, but not the same as, the \"ground truth\" answer in (A). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution:**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem Eight \n", "\n", "CANCELLED!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem Nine (CLT and Sampling) \n", "\n", "Suppose you know that when there was a vote for the major of Cambridge, MA, 54.8% of the people voted for candidate A. \n", "\n", "(a) Supposing samples of size 30 are taken, what would you expect to be the mean and standard deviation of the sampling distribution of proportions? Give your result in terms of percentages.\n", "\n", "(b) Supposing we wanted to get more accuracy in our sampling procedure, so that we wanted the standard deviation of the sample distribution of proportions to be at most 5%. What is the smallest sample size we could use to insure this?\n", "\n", "(c) Supposing you take 100 samples of size 30, in how many samples would you expect to find the proportion accurate to 1%, i.e., between 53.8% and 55.8% ? (Don't worry about using the continuity correction here.)\n", "\n", "Hint: This is the same as the last problem, but where the population represents the outcomes of a Bernoulli\n", "experiment with p = 0.548. In such cases, we do not need to be given the population standard deviation, because it is determined by a formula involving p; (get out\n", "your \"cheatsheet\" from the midterm!). This is a special case called \"sampling with proportions.\" " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution:**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem Ten \n", "\n", "CANCELLED! " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }